Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]
SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Sign Language MNIST dataset is a multi-class classification situation where we attempt to predict one of several (more than two) possible outcomes.
INTRODUCTION: The original MNIST image dataset of handwritten digits is a popular benchmark for image-based machine learning methods but researchers have renewed efforts to update it and develop drop-in replacements that are more challenging for computer vision and original for real-world applications. To stimulate the community to develop more drop-in replacements, the Sign Language MNIST is presented here and follows the same CSV format with labels and pixel values in single rows. The American Sign Language letter database of hand gestures represent a multi-class problem with 24 classes of letters (excluding J and Z which require motion).
The dataset format is patterned to match closely with the classic MNIST. Each training and test case represents a label (0-25) as a one-to-one map for each alphabetic letter A-Z (and no cases for 9=J or 25=Z because of gesture motions). The training data (27,455 cases) and test data (7172 cases) are approximately half the size of the standard MNIST but otherwise similar with a header row of label, pixel1,pixel2….pixel784 which represent a single 28x28 pixel image with grayscale values between 0-255. The original hand gesture image data represented multiple users repeating the gesture against different backgrounds.
ANALYSIS: The performance of the preliminary XGBoost model achieved an accuracy benchmark of 95.41%. After a series of tuning trials, the best XGBoost model processed the training dataset with an accuracy score of 99.68%. When we applied the final model to the previously unseen test dataset, we obtained an accuracy score of 78.93%, which pointed to a high variance error.
CONCLUSION: In this iteration, XGBoost appeared to be a suitable algorithm for modeling this dataset. We should consider experimenting with XGBoost for further modeling.
Dataset Used: Sign Language MNIST Data Set
Dataset ML Model: Multi-class with numerical attributes
Dataset Reference: https://www.kaggle.com/datamunge/sign-language-mnist
One source of potential performance benchmarks: https://www.kaggle.com/datamunge/sign-language-mnist
Any predictive modeling machine learning project generally can be broken down into about six major tasks:
# Install the necessary packages for Colab
!pip install python-dotenv PyMySQL
# Retrieve the GPU information from Colab
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
print('and then re-execute this cell.')
else:
print(gpu_info)
# Retrieve the memory configuration from Colab
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))
if ram_gb < 20:
print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
print('re-execute this cell.')
else:
print('You are using a high-RAM runtime!')
# Retrieve the CPU information
ncpu = !nproc
print("The number of available CPUs is:", ncpu[0])
# Set the random seed number for reproducible results
seedNum = 888
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sys
# import boto3
from datetime import datetime
from dotenv import load_dotenv
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# from sklearn.impute import SimpleImputer
# from sklearn.feature_selection import RFE
# from imblearn.pipeline import Pipeline
# from imblearn.over_sampling import SMOTE
# from imblearn.under_sampling import RandomUnderSampler
from xgboost import XGBClassifier
# Begin the timer for the script processing
startTimeScript = datetime.now()
# Set up the number of CPU cores available for multi-thread processing
n_jobs = 1
# Set up the flag to stop sending progress emails (setting to True will send status emails!)
notifyStatus = False
# Set Pandas options
pd.set_option("display.max_rows", 500)
pd.set_option("display.width", 140)
# Set the percentage sizes for splitting the dataset
test_set_size = 0.2
val_set_size = 0.25
# Set the number of folds for cross validation
n_folds = 5
# Set various default modeling parameters
scoring = 'accuracy'
# Set the number of classes for XGBoost
n_classes = 3
# Set up the parent directory location for loading the dotenv files
# useColab = True
# if useColab:
# # Mount Google Drive locally for storing files
# from google.colab import drive
# drive.mount('/content/gdrive')
# gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
# env_path = '/content/gdrive/My Drive/Colab Notebooks/'
# dotenv_path = env_path + "python_script.env"
# load_dotenv(dotenv_path=dotenv_path)
# Set up the dotenv file for retrieving environment variables
# useLocalPC = True
# if useLocalPC:
# env_path = "/Users/david/PycharmProjects/"
# dotenv_path = env_path + "python_script.env"
# load_dotenv(dotenv_path=dotenv_path)
# Set up the email notification function
def status_notify(msg_text):
access_key = os.environ.get('SNS_ACCESS_KEY')
secret_key = os.environ.get('SNS_SECRET_KEY')
aws_region = os.environ.get('SNS_AWS_REGION')
topic_arn = os.environ.get('SNS_TOPIC_ARN')
if (access_key is None) or (secret_key is None) or (aws_region is None):
sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
response = sns.publish(TopicArn=topic_arn, Message=msg_text)
if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])
if notifyStatus: status_notify("Task 1 - Prepare Environment has begun! " + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
dataset_path = 'https://dainesanalytics.com/datasets/kaggle-sign-language-mnist/sign_mnist_train.csv'
Xy_original = pd.read_csv(dataset_path, sep=',')
# Take a peek at the dataframe after import
Xy_original.head()
Xy_original.info(verbose=True)
Xy_original.describe()
Xy_original.isnull().sum()
print('Total number of NaN in the dataframe: ', Xy_original.isnull().sum().sum())
# Standardize the class column to the name of targetVar if required
Xy_original = Xy_original.rename(columns={'label': 'targetVar'})
# Use variable totCol to hold the number of columns in the dataframe
totCol = len(Xy_original.columns)
# Set up variable totAttr for the total number of attribute columns
totAttr = totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# If (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization
targetCol = 1
# We create attribute-only and target-only datasets (X_original and y_original)
# for various visualization and cleaning/transformation operations
if targetCol == totCol:
X_original = Xy_original.iloc[:,0:totAttr]
y_original = Xy_original.iloc[:,totAttr]
else:
X_original = Xy_original.iloc[:,1:totCol]
y_original = Xy_original.iloc[:,0]
print("Xy_original.shape: {} X_original.shape: {} y_original.shape: {}".format(Xy_original.shape, X_original.shape, y_original.shape))
# Split the data further into training and validation datasets
X_train_df = X_original.copy()
y_train_df = y_original.copy()
print("X_train_df.shape: {} y_train_df.shape: {}".format(X_train_df.shape, y_train_df.shape))
if notifyStatus: status_notify("Task 1 - Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
if notifyStatus: status_notify("Task 2 - Summarize and Visualize Data has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol = 4
if totAttr % dispCol == 0 :
dispRow = totAttr // dispCol
else :
dispRow = (totAttr // dispCol) + 1
# Set figure width to display the data visualization plots
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = dispCol*4
fig_size[1] = dispRow*4
plt.rcParams["figure.figsize"] = fig_size
# Histograms for each attribute
X_train_df.hist(layout=(dispRow,dispCol))
plt.show()
# Box and Whisker plot for each attribute
X_train_df.plot(kind='box', subplots=True, layout=(dispRow,dispCol))
plt.show()
# Correlation matrix
fig = plt.figure(figsize=(16,12))
ax = fig.add_subplot(111)
correlations = X_train_df.corr(method='pearson')
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
plt.show()
if notifyStatus: status_notify("Task 2 - Summarize and Visualize Data completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
if notifyStatus: status_notify("Task 3 - Pre-process Data has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
# Compose pipeline for the numerical and categorical features
numeric_columns = X_train_df.select_dtypes(include=['int','float']).columns
numeric_transformer = Pipeline(steps=[
# ('imputer', SimpleImputer(strategy='constant', fill_value=0)),
('scaler', preprocessing.StandardScaler())
])
categorical_columns = X_train_df.select_dtypes(include=['object','category']).columns
categorical_transformer = Pipeline(steps=[
# ('imputer', SimpleImputer(strategy='constant', fill_value='NA')),
('onehot', preprocessing.OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_columns)
# ('cat', categorical_transformer, categorical_columns)
])
print("Number of numerical columns:", len(numeric_columns))
print("Number of categorical columns:", len(categorical_columns))
print("Total number of columns in the dataframe:", X_train_df.shape[1])
# Apply the pre-processing pipeline to the training dataset
X_train = preprocessor.fit_transform(X_train_df)
print("Transformed X_train.shape:", X_train.shape)
print(X_train)
# Not applicable for this iteration of the project
# Not applicable for this iteration of the project
# Apply the encoder to the training dataset
class_encoder = preprocessing.LabelEncoder()
y_train = class_encoder.fit_transform(y_train_df)
print("X_train.shape: {} y_train.shape: {}".format(X_train.shape, y_train.shape))
if notifyStatus: status_notify("Task 3 - Pre-process Data completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
if notifyStatus: status_notify("Task 4 - Train and Evaluate Models has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
# Set up Algorithms Spot-Checking Array
startTimeModule = datetime.now()
# train_models = [('XGB', XGBClassifier(random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes))]
train_models = [('XGB', XGBClassifier(random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes, tree_method='gpu_hist'))]
# Generate model in turn
for name, model in train_models:
if notifyStatus: status_notify("Algorithm "+name+" modeling has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
startTimeModule = datetime.now()
kfold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring, n_jobs=n_jobs, verbose=1)
print("%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()))
print(model)
print ('Model training time:', (datetime.now() - startTimeModule), '\n')
if notifyStatus: status_notify("Algorithm "+name+" modeling completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
# Set up the comparison array
tune_results = []
tune_model_names = []
# Tuning XGBoost n_estimators, max_depth, and min_child_weight parameters
startTimeModule = datetime.now()
if notifyStatus: status_notify("Algorithm tuning iteration #1 has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
# tune_model1 = XGBClassifier(random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes)
tune_model1 = XGBClassifier(random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes, tree_method='gpu_hist')
tune_model_names.append('XGB_1')
paramGrid1 = dict(n_estimators=range(500,901,100), max_depth=np.array([3,6]), min_child_weight=np.array([1,2]))
kfold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
grid1 = GridSearchCV(estimator=tune_model1, param_grid=paramGrid1, scoring=scoring, cv=kfold, n_jobs=n_jobs, verbose=1)
grid_result1 = grid1.fit(X_train, y_train)
print("Best: %f using %s" % (grid_result1.best_score_, grid_result1.best_params_))
tune_results.append(grid_result1.cv_results_['mean_test_score'])
means = grid_result1.cv_results_['mean_test_score']
stds = grid_result1.cv_results_['std_test_score']
params = grid_result1.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
print ('Model training time:',(datetime.now() - startTimeModule))
if notifyStatus: status_notify("Algorithm tuning iteration #1 completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
# Tuning XGBoost subsample and colsample_bytree parameters
startTimeModule = datetime.now()
if notifyStatus: status_notify("Algorithm tuning iteration #2 has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
# tune_model2 = XGBClassifier(n_estimators=100, max_depth=6, min_child_weight=1, random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes)
tune_model2 = XGBClassifier(n_estimators=800, max_depth=3, min_child_weight=1, random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', num_class=n_classes, tree_method='gpu_hist')
tune_model_names.append('XGB_2')
paramGrid2 = dict(subsample=np.array([0.7,0.8,0.9,1.0]), colsample_bytree=np.array([0.7,0.8,0.9,1.0]))
kfold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
grid2 = GridSearchCV(estimator=tune_model2, param_grid=paramGrid2, scoring=scoring, cv=kfold, n_jobs=n_jobs, verbose=1)
grid_result2 = grid2.fit(X_train, y_train)
print("Best: %f using %s" % (grid_result2.best_score_, grid_result2.best_params_))
tune_results.append(grid_result2.cv_results_['mean_test_score'])
means = grid_result2.cv_results_['mean_test_score']
stds = grid_result2.cv_results_['std_test_score']
params = grid_result2.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
print ('Model training time:',(datetime.now() - startTimeModule))
if notifyStatus: status_notify("Algorithm tuning iteration #2 completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
fig = plt.figure(figsize=(16,12))
fig.suptitle('Algorithm Comparison - Post Tuning')
ax = fig.add_subplot(111)
plt.boxplot(tune_results)
ax.set_xticklabels(tune_model_names)
plt.show()
if notifyStatus: status_notify("Task 4 - Train and Evaluate Models completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
if notifyStatus: status_notify("Task 5 - Finalize Model and Present Analysis has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
dataset_path = 'https://dainesanalytics.com/datasets/kaggle-sign-language-mnist/sign_mnist_test.csv'
Xy_test = pd.read_csv(dataset_path, sep=',')
# Take a peek at the dataframe after import
Xy_test.head()
# Standardize the class column to the name of targetVar if required
Xy_test = Xy_test.rename(columns={'label': 'targetVar'})
X_test_df = Xy_test.iloc[:,1:totCol]
y_test_df = Xy_test.iloc[:,0]
print("Xy_test.shape: {} X_test_df.shape: {} y_test_df.shape: {}".format(Xy_test.shape, X_test_df.shape, y_test_df.shape))
# Apply the pre-processing pipeline to the test dataset
X_test = preprocessor.transform(X_test_df)
print("Transformed X_test.shape:", X_test.shape)
print(X_test)
# Apply the encoder to the test dataset
y_test = class_encoder.transform(y_test_df)
print("X_test.shape: {} y_test.shape: {}".format(X_test.shape, y_test.shape))
# test_model = XGBClassifier(n_estimators=100, max_depth=6, min_child_weight=1, colsample_bytree=0.8, subsample=0.8, random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax')
test_model = XGBClassifier(n_estimators=800, max_depth=3, min_child_weight=1, colsample_bytree=0.8, subsample=0.7, random_state=seedNum, n_jobs=n_jobs, objective='multi:softmax', tree_method='gpu_hist')
test_model.fit(X_train, y_train)
print(test_model)
test_predictions = test_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, test_predictions))
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
if notifyStatus: status_notify("Task 5 - Finalize Model and Present Analysis completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Total time for the script:',(datetime.now() - startTimeScript))